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Abstract 

A  popular  approach  to  significance  testing  proposes  to  decide  whether  the  given 
hypothesized  statistical  model  is  likely  to  be  true  (or  false).  Statistical  decision  theory 
provides  a  basis  for  this  approach  by  requiring  every  significance  test  to  make  a  decision 
about  the  truth  of  the  hypothesis/model  under  consideration.  Unfortunately,  many  in¬ 
teresting  and  useful  models  are  obviously  false  (that  is,  not  exactly  true)  even  before 
considering  any  data.  Fortunately,  in  practice  a  significance  test  need  only  gauge  the 
consistency  (or  inconsistency)  of  the  observed  data  with  the  assumed  hypothesis/model 
-  without  enquiring  as  to  whether  the  assumption  is  likely  to  be  true  (or  false),  or 
whether  some  alternative  is  likely  to  be  true  (or  false).  In  this  practical  formulation,  a 
significance  test  rejects  a  hypothesis/model  only  if  the  observed  data  is  highly  improb¬ 
able  when  calculating  the  probability  while  assuming  the  hypothesis  being  tested;  the 
significance  test  only  gauges  whether  the  observed  data  likely  invalidates  the  assumed 
hypothesis,  and  cannot  decide  that  the  assumption  —  however  unmistakably  false  — 
is  likely  to  be  false  a  priori,  without  any  data. 

Essentially,  all  models  are  wrong,  but  some  are  useful.  —  G.  E.  P.  Box 
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1  Introduction 


As  pointed  out  in  the  above  quotation  of  G.  E.  P.  Box,  many  interesting  models  are  false 
(that  is,  not  exactly  true),  yet  are  useful  nonetheless.  Significance  testing  helps  measure 
the  usefulness  of  a  model.  Testing  the  validity  of  using  a  model  for  virtually  any  purpose 
requires  knowing  whether  observed  discrepancies  are  due  to  inaccuracies  or  inadequacies  in 
the  model  or  (on  the  contrary)  could  be  due  to  chance  arising  from  necessarily  finite  sample 
sizes.  Significance  tests  gauge  whether  the  discrepancy  between  the  model  and  the  observed 
data  is  larger  than  expected  random  fluctuations;  significance  tests  gauge  the  size  of  the 
unavoidable  random  fluctuations. 

A  traditional  approach,  along  with  its  modern  formulation  in  statistical  decision  theory, 
tries  to  decide  whether  a  hypothesized  model  is  likely  to  be  true  (or  false).  However,  in 
many  practical  circumstances,  a  significance  test  need  only  gauge  the  consistency  (or  incon¬ 
sistency)  of  the  observed  data  with  the  assumed  hypothesis/model  —  without  ever  enquiring 
as  to  whether  the  assumption  is  likely  to  be  true  (or  false),  or  whether  some  alternative  is 
likely  to  be  true  (or  false).  In  this  practical  formulation,  a  significance  test  rejects  a  hypoth¬ 
esis/model  only  if  the  observed  data  is  highly  improbable  when  calculating  the  probability 
while  assuming  the  hypothesis  being  tested.  Whether  or  not  the  assumption  could  be  exactly 
true  in  reality  is  irrelevant. 

An  illustrative  example  may  help  clarify.  When  testing  the  goodness  of  fit  for  the  Poisson 
regression  where  the  distribution  of  Y  given  x  is  the  Poisson  distribution  of  mean  exp((T°)  + 
9^x  +  9^x2  +  d^#3),  the  conventional  Neyman- Pearson  null  hypothesis  is 

HqF  :  there  exist  real  numbers  9^°\  9lyl\  9^2\  9 ^  such  that  2/1,  2/2, _ ,  yn  are  independent 

draws  from  the  Poisson  distributions  with  means  ,/x2, . . . ,  yn,  respectively,  (1) 

where 

Hfik)  =  9 (°>  +  9^Xk  +  9{2\xk)2  +  9(3\xk)3  (2) 

for  k  —  1,  2,  . . . ,  n,  and  the  observations  (#1,2/1),  (#2,2/2),  •  •  • ,  (■ xn,yn )  are  ordered  pairs  of 
scalars  (real  numbers  paired  with  nonnegative  integers).  A  related,  but  perhaps  simpler  null 
hypothesis  is 

Ho  :  7/1, 2/2,  •  •  • ,  yn  are  independent  draws  from  the  Poisson  distributions 

with  means  /A,  /r2, . . . ,  yn,  respectively,  (3) 

where 

In  (/A)  =  9(0)  +  9{1)Xk  +  9(2\xk)2  +  9(3\xk)3  (4) 

for  k  —  1,  2,  . . . ,  n,  with  9  being  a  maximum-likelihood  estimate.  Needless  to  say,  even  if 
the  observed  data  really  does  arise  from  Poisson  distributions  whose  means  are  exponentials 
of  a  cubic  polynomial,  the  particular  values  9^°\  9^\  9^2\  9^'>  of  the  parameters  of  the  fitted 
polynomial  will  almost  surely  not  be  exactly  equal  to  the  true  values.  Even  though  the 
estimated  values  of  the  parameters  may  not  be  exactly  correct,  it  still  makes  good  sense  to 
enquire  as  to  whether  the  fitted  cubic  polynomial  is  consistent  with  the  data  up  to  random 
fluctuations  inherent  in  using  a  finite  amount  of  observed  data. 
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In  fact,  since  subsequent  use  of  the  model  usually  involves  the  particular  fitted  polynomial 
-  whose  specification  includes  the  observed  parameter  estimates  —  analyzing  the  model 
including  the  estimated  values  of  the  parameters  makes  more  sense  than  trying  to  decide 
whether  the  data  really  did  come  from  Poisson  distributions  whose  means  are  exponentials 
of  some  unspecified  cubic  polynomial.  For  instance,  any  plot  of  the  fit  (such  as  a  plot  of  the 
means  of  the  Poisson  distributions)  must  use  the  estimated  values  of  the  parameters,  and 
any  statistical  interpretation  of  the  plot  should  also  depend  explicitly  on  the  estimates; 
a  significance  test  can  gauge  the  consistency  of  the  plotted  fit  with  the  observed  data, 
without  ever  asking  whether  the  plotted  fit  is  the  truth  (it  is  almost  surely  not  identical 
to  the  underlying  reality)  and  without  making  some  decision  about  an  abstract  family  of 
polynomials  which  may  or  may  not  include  both  the  plotted  fit  and  the  underlying  reality. 

A  popular  measure  of  divergence  from  the  null  hypothesis  is  the  log-likclihood-ratio 

n 

92  =  2^j/fcln(j/fc//ifc).  (5) 

k= 1 

A  P- value  (see,  for  example,  Section  3  below)  quantifies  whether  this  divergence  is  larger  than 
expected  from  random  fluctuations  inherent  in  using  only  n  data  points.  It  is  not  obvious  how 
to  calculate  an  exact  P- value  for  H^p  from  (1)  and  (2),  which  refers  to  cubic  polynomials 
with  undetermined  coefficients.  In  contrast,  Hq  from  (3)  and  (4)  refers  explicitly  to  the 
particular  fitted  value  9 ;  Hq  concerns  the  particular  fit  displayed  in  a  plot,  and  is  natural  for 
the  statistical  interpretation  of  such  a  plot. 

Thus,  when  calculating  significance,  the  assumed  model  should  include  the  particular 
values  of  any  parameters  estimated  from  the  observed  data.  Such  parameters  are  known  as 
“nuisance”  parameters.  As  illustrated  with  Hq  from  (3)  and  (4),  the  assumed  hypothesis 
will  be  “simple”  in  the  Neyman-Pearson  sense,  but  will  depend  on  the  observed  values  of 
the  parameters  —  that  is,  the  hypothesis  will  be  “data- dependent” ;  the  hypothesis  will  be 
“random.”  Including  the  particular  values  of  the  parameters  estimated  from  the  observed 
data  replaces  the  “composite”  hypothesis  of  the  conventional  Neyman-Pearson  formulation 
with  a  “simple”  data- dependent  hypothesis.  As  discussed  in  Section  4  below,  fully  conditional 
tests  also  incorporate  the  observed  values  of  the  parameters,  but  make  the  extra  assumption 
that  all  possible  realizations  of  the  experiment  —  observed  or  hypothetical  —  generate  the 
same  observed  values  of  the  parameters.  The  device  of  a  “simple  data-dependent  hypothesis” 
such  as  Hq  incorporates  the  observed  values  explicitly  without  the  extra  assumption. 

For  most  purposes,  a  parameterized  model  is  not  really  operational  —  that  is,  suitable 
for  making  precise  predictions  —  until  its  specification  is  completed  via  the  inclusion  of 
estimates  for  any  nuisance  parameters.  The  results  of  the  significance  tests  considered  below 
depend  on  the  quality  of  both  the  models  and  the  parameter  estimators.  However,  the  results 
are  relatively  insensitive  to  the  particular  observed  realizations  of  the  parameter  estimators 
(that  is,  to  the  parameter  estimates)  unless  specifically  designed  to  quantify  the  quality  of 
the  parameter  estimates.  To  quantify  the  quality  of  the  parameter  estimates,  we  recommend 
testing  separately  the  goodness  of  fit  of  the  parameter  estimates,  using  confidence  intervals, 
confidence  distributions,  parametric  bootstrapping,  or  significance  tests  within  parametric 
models,  whose  statistical  power  is  focused  against  alternatives  within  the  parametric  family 
constituting  the  model  (for  further  discussion  of  the  latter,  see  Section  5  below). 
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The  remainder  of  the  present  article  has  the  following  structure:  Section  2  very  briefly 
discusses  Bayesian-frequentist  hybrids,  referring  for  details  to  the  definitive  work  of  Gelrnan 
(2003).  Section  3  defines  P- values  —  also  known  as  “attained  significance  levels”  —  which 
quantify  the  consistency  of  the  observed  data  with  the  assumed  models.  Section  4  details 
several  approaches  to  testing  the  goodness  of  fit  for  distributional  profile.  Section  5  discusses 
testing  the  goodness  of  fit  for  various  properties  beyond  just  distributional  profile. 

Cox  (2006)  details  many  advantages  of  interpreting  significance  as  gauging  the  consistency 
of  an  assumption/hypothesis  with  observed  data,  rather  than  as  making  decisions  about  the 
actual  truth  of  the  assumption.  However,  significance  testing  is  meaningless  without  any 
observations,  unlike  purely  Bayesian  methods,  which  can  produce  results  without  any  data, 
courtesy  of  the  prior  (the  prior  is  the  statistician’s  system  of  a  priori  beliefs,  accumulated 
from  prior  experience,  law,  morality,  religion,  etc.,  without  reference  to  the  observed  data). 
Significance  tests  are  deficient  in  this  respect.  Those  interested  in  what  is  to  be  considered 
true  in  reality  and  in  making  decisions  more  generally  should  use  Bayesian  and  sequential  (in¬ 
cluding  multilevel)  procedures.  Significance  testing  simply  gauges  the  consistency  of  models 
with  observed  data;  generally  significance  testing  alone  cannot  handle  the  truth. 


2  Bayesian  versus  frequentist 

Traditionally,  significance  testing  is  frequentist.  However,  there  exist  Bayesian-frequentist 
hybrids  known  as  “Bayesian  P- values”;  Gelrnan  (2003)  sets  forth  a  particularly  appealing 
formulation.  Bayesian  P-values  test  the  consistency  of  the  observed  data  with  the  model 
used  together  with  a  prior  for  nuisance  parameters.  In  contrast,  the  P-values  discussed  in 
the  present  paper  test  the  consistency  of  the  observed  data  with  the  model  used  together 
with  a  parameter  estimator.  In  the  Bayesian  formulation,  a  P-value  depends  explicitly  on 
the  choice  of  prior;  in  the  formulation  of  the  present  paper,  a  P-value  depends  explicitly  on 
the  choice  of  parameter  estimator.  Thus,  when  there  are  nuisance  parameters,  the  two  types 
of  P-values  test  slightly  different  hypotheses  and  provide  slightly  different  information;  each 
type  is  ideal  for  its  own  set-up.  Of  course,  if  there  are  no  nuisance  parameters,  then  Bayesian 
P-values  and  the  P-values  discussed  below  are  the  same. 


3  P-values 

A  P-value  for  a  hypothesis  Hq  is  a  statistic  such  that,  if  the  P-value  is  very  small,  then 
we  can  be  confident  that  the  observed  data  is  inconsistent  with  assuming  Hq.  The  P-value 
associated  with  a  measure  of  divergence  from  a  hypothesis  Hq  is  the  probability  that  D  >  d, 
where  d  is  the  divergence  between  the  observed  and  the  expected  (with  the  expectation 
following  Hq  for  the  observations),  and  D  is  the  divergence  between  the  simulated  and  the 
expected  (with  the  expectation  following  Hq  for  the  simulations,  and  with  the  simulations 
performed  assuming  Hq).  When  taking  the  probability  that  D  >  d,  we  view  D  as  a  random 
variable,  while  viewing  d  as  fixed,  not  random.  For  example,  when  testing  the  goodness  of 
fit  for  the  model  of  i.i.d.  draws  from  a  probability  distribution  po(9),  where  9  is  a  nuisance 
parameter  that  must  be  estimated  from  the  data,  that  is,  from  observations  X\,  x2,  ■ . . ,  xni 
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we  use  the  null  hypothesis 

H0  :  xi,  x2,  ■  ■  ■ ,  xn  are  i.i.d.  draws  from  Po(9),  where  9  =  9{x i,x2, . . . ,  xn).  (6) 
The  P-value  for  H0  associated  with  a  divergence  5  is  the  probability  that  D  >  d,  where 

•  d  =  S(p,p0(9)), 

•  p  is  the  empirical  distribution  of  aq,  x2,  . . . ,  xn, 

•  9  is  the  parameter  estimate  obtained  from  the  observed  draws  xi,  x2l  ■  ■  ■ ,  xn, 

•  D  =  5(p,Pom, 

•  P  is  the  empirical  distribution  of  i.i.d.  draws  X\,  X2,  . . . ,  Xn  from  Po(9),  and 

•  0  is  the  parameter  estimate  obtained  from  the  simulated  draws  Xl}  X2,  . . . ,  Xn. 

If  the  P-value  is  very  small,  then  we  can  be  confident  that  the  observed  data  is  inconsistent 
with  assuming  H0.  Examples  of  divergences  include  y2  (for  categorical  data)  and  the  maxi¬ 
mum  absolute  difference  between  cumulative  distribution  functions  (for  real- valued  data). 

Remark  3.1.  To  compute  the  P-value  assessing  the  consistency  of  the  experimental  data 
with  assuming  Hq,  we  can  use  Monte-Carlo  simulations  (very  similar  to  those  used  by  Clauset 
et  al.  (2009)).  First,  we  estimate  the  parameter  9  from  the  n  given  experimental  draws, 
obtaining  9 ,  and  calculate  the  divergence  between  the  empirical  distribution  and  Po(9).  We 
then  run  many  simulations.  To  conduct  a  single  simulation,  we  perform  the  following  three- 
step  procedure: 

1.  we  generate  n  i.i.d.  draws  according  to  the  model  distribution  po(6),  where  9  is  the 
estimate  calculated  from  the  experimental  data, 

2.  we  estimate  the  parameter  9  from  the  data  generated  in  Step  1,  obtaining  a  new 
estimate  9 ,  and 

3.  we  calculate  the  divergence  between  the  empirical  distribution  of  the  data  generated  in 
Step  1  and  po(9),  where  9  is  the  estimate  calculated  in  Step  2  from  the  data  generated 
in  Step  1. 

After  conducting  many  such  simulations,  we  may  estimate  the  P-value  for  assuming  H0 
as  the  fraction  of  the  divergences  calculated  in  Step  3  that  are  greater  than  or  equal  to 
the  divergence  calculated  from  the  empirical  data.  The  accuracy  of  the  estimated  P-value  is 
inversely  proportional  to  the  square  root  of  the  number  of  simulations  conducted;  for  details, 
see  Remark  3.2  below. 

Remark  3.2.  The  standard  error  of  the  estimate  from  Remark  3.1  for  an  exact  P-value  P 
is  y/P(  1  —  P)Ji  where  ^  is  the  number  of  Monte-Carlo  simulations  conducted  to  produce 
the  estimate.  Indeed,  each  simulation  has  probability  P  of  producing  a  divergence  that  is 
greater  than  or  equal  to  the  divergence  corresponding  to  an  exact  P-value  of  P.  Since  the 
simulations  are  all  independent,  the  number  of  the  l  simulations  that  produce  divergences 
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greater  than  or  equal  to  that  corresponding  to  P-value  P  follows  the  binomial  distribution 
with  t  trials  and  probability  P  of  success  in  each  trial.  The  standard  deviation  of  the  number 
of  simulations  whose  divergences  are  greater  than  or  equal  to  that  corresponding  to  P-value 
P  is  therefore  iP(  1  —  P),  and  so  the  standard  deviation  of  the  fraction  of  the  simulations 
producing  such  divergences  is  y/P{ 1  —  P)/U  Of  course,  the  fraction  itself  is  the  Monte- 
Carlo  estimate  of  the  exact  P-value  (we  use  this  estimate  in  place  of  the  unknown  P  when 
calculating  the  standard  error  y/P(  1  —  P)/t). 

4  Goodness  of  fit  for  distributional  profile 

Given  observations  aq,  x2l  •  •  • ,  xn,  we  can  test  the  goodness  of  fit  for  the  model  of  i.i.d. 
draws  from  a  probability  distribution  p0(6 ),  where  9  is  a  nuisance  parameter,  via  the  null 
hypothesis 

H0  :  aq,  X2,  ■  ■  ■ ,  xn  are  i.i.d.  draws  from  po(6) 

for  the  particular  observed  value  of  9  =  9(xi,x2,  ■  ■  ■ ,  xn).  (7) 

The  Neyman-Pearson  formulation  considers  instead  the  null  hypothesis 

HqF  :  there  exists  a  value  of  9  such  that  xi,x2, . . . ,  xn  are  i.i.d.  draws  from  po(9).  (8) 

The  fully  conditional  null  hypothesis  is 

HFC  :  xi,x2, ...  ,xn  are  i.i.d.  draws  from  po{9) 

and  9  =  0(xi,x2, . . . ,  xn )  takes  the  same  value  in  all  possible  realizations.  (9) 

That  is,  whereas  Hq  supposes  that  the  particular  observed  realization  of  the  experiment 
happened  to  produce  a  parameter  estimate  9  that  is  consistent  with  having  drawn  the  data 
from  Po(9),  HFC  assumes  that  every  possible  realization  of  the  experiment  —  observed  or 
hypothetical  —  produces  exactly  the  same  parameter  estimate.  Few  experimental  apparati 
constrain  the  parameter  estimate  to  always  take  the  same  (a  priori  unknown)  value  during 
repetitions  of  the  experiment,  as  HqC  assumes.  Assuming  Hfc  amounts  to  conditioning  on  a 
statistic  that  is  minimally  sufficient  for  estimating  9\  computing  the  associated  P- values  is  not 
always  trivial.  Furthermore,  the  assumption  that  Hq  is  true  seems  to  be  more  extreme,  a 
more  substantial  departure  from  HqF  ,  than  Hq.  Finally,  testing  the  significance  of  assuming 
Hq  would  seem  to  be  more  apropos  in  practice  for  applications  in  which  the  experimental 
design  does  not  enforce  that  repeated  experiments  always  yield  the  same  value  for  po(9).  We 
cannot  recommend  the  use  of  Hq  c  in  general.  Unfortunately,  H^F  also  presents  problems. . . . 

If  the  probability  distributions  are  discrete,  there  is  no  obvious  means  for  defining  an 
exact  P-value  for  HqF  when  HqF  is  false;  moreover,  any  P-value  for  H^F  when  HqF  is 
true  would  depend  on  the  correct  value  of  the  parameter  9,  and  the  observed  data  does 
not  determine  this  value  exactly.  The  situation  may  be  more  favorable  when  measuring 
discrepancies  with  divergences  that  are  “approximately  ancillary”  with  respect  to  9 ,  but 
quantifying  “approximately”  seems  to  be  problematic  except  in  the  limit  of  large  numbers  of 
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draws.  (Some  divergences  are  asymptotically  ancillary  in  the  limit  of  large  numbers  of  draws, 
but  this  is  not  especially  helpful,  as  any  asymptotically  consistent  estimator  9  converges  to 
the  correct  value  in  the  limit  of  large  numbers  of  draws;  9  is  almost  surely  known  exactly  in 
the  limit  of  large  numbers  of  draws,  so  there  is  no  benefit  to  being  independent  of  9  in  that 
limit.)  Section  3  of  Robins  and  Wasserman  (2000)  reviews  these  and  related  issues. 

Remark  4.1.  Romano  (1988),  Bickel  et  al.  (2006),  and  others  have  shown  that  the  P-values 
for  H0  converge  in  distribution  to  the  uniform  distribution  over  [0, 1]  in  the  limit  of  large 
numbers  of  draws,  when  HqP  is  true.  In  particular,  Romano  (1988)  proves  this  convergence 
for  a  wide  class  of  divergence  measures. 

Remark  4.2.  The  surveys  of  Agresti  (1992)  and  Agresti  (2001)  discuss  exact  P-values  for 
contingency-tables/cross-tabulations,  including  criticism  of  fully  conditional  P-values.  Gel- 
man  (2003)  provides  further  criticism  of  fully  conditional  P-values.  Ward  (2012)  numerically 
evaluates  the  different  types  of  P-values  for  an  application  in  population  genetics.  Section  4 
of  Bayarri  and  Berger  (2004)  and  the  references  it  cites  discuss  the  menagerie  of  alternative 
P-values  proposed  recently. 

5  Goodness  of  fit  for  various  properties 

For  comparative  purposes,  we  first  review  the  null  hypothesis  of  the  previous  section  for 
testing  the  goodness  of  fit  for  distributional  profile,  namely 

H0  :  xi,  x2,  •  •  • ,  xn  are  i.i.d.  draws  from  po(9),  where  9  =  9(x i,x2, . . . ,  xn),  (10) 

with  9  being  the  nuisance  parameter.  The  measure  of  discrepancy  for  Hq  is  usually  taken  to 
be  a  divergence  between  the  empirical  distribution  p  and  the  model  po(9)  (in  the  continuous 
case  in  one  dimension,  a  common  characterization  of  the  empirical  distribution  is  the  empir¬ 
ical  cumulative  distribution  function;  in  the  discrete  case,  a  common  characterization  of  the 
empirical  distribution  is  the  empirical  probability  mass  function,  that  is,  the  set  of  empirical 
proportions).  One  example  for  p0  is  the  Zipf  distribution  over  m  bins  with  parameter  9,  a 
discrete  distribution  with  the  probability  mass  function 

rfV)  =  j  (ii) 

for  j  =  1,  2,  3,  ... ,  m,  where  the  normalization  constant  is 

Ce  =  J  ,_e  (12) 

2^j=i  J 

and  9  is  a  nonnegative  real  number. 

When  testing  the  goodness  of  fit  for  parameter  estimates,  we  use  the  null  hypothesis 

H',  :  Xi,  Xr2 , . . .  ,xn  are  i.i.d.  draws  from  po(0o,  0),  where  9  =  9(x i,  £2,  •  •  • ,  xn ),  (13) 

with  9  being  the  nuisance  parameter  and  (j)  being  the  parameter  of  interest  (and  with  0o 
being  the  value  of  0  assumed  under  the  model).  Please  note  that  H0  and  Hq  are  actually 
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equivalent,  via  the  identification  po(9)  =  po(0O)  9).  However,  the  measure  of  discrepancy  for 
Hq  is  usually  taken  to  be  a  divergence  between  0  and  0O  rather  than  the  divergence  between 
p  and  po(9)  that  is  more  natural  for  H0.  Also,  if  0  is  scalar-valued,  then  confidence  intervals, 
confidence  distributions,  and  parametric  bootstrap  distributions  are  more  informative  than 
a  significance  test.  A  significance  test  is  appropriate  if  0  is  vector- valued.  One  example 
for  p0  is  the  sorted  Zipf  distribution  over  m  bins  with  9  being  the  power  in  the  power  law 
and  with  the  maximum-likelihood  estimate  0  being  a  permutation  that  sorts  the  bins  into 
rank-order,  that  is,  p0  is  the  discrete  distribution  with  the  probability  mass  function 


Pq]  (0, 9) 


Cq 


(14) 


for  j  =  1,  2,  3,  . . . ,  m,  where  the  normalization  constant  Cg  is  defined  in  (12)  with  9  being 
a  nonnegative  real  number,  and  0  is  a  permutation  of  the  numbers  1,  2,  . . . ,  m.  The  choice 
for  0o  that  is  of  widest  interest  in  applications  is  the  identity  permutation  (that  is,  the 
“rearrangement”  of  the  bins  that  does  not  permute  any  bins:  0o(j)  =  j  for  j  —  1,  2,  . . . ,  m). 

When  testing  the  goodness  of  fit  for  the  standard  Poisson  regression  with  the  distribution 
of  Y  given  x  being  the  Poisson  distribution  of  mean  exp  j ,  we  use  the 

null  hypothesis 


H"  :  ij\ ,  y2, . .  . ,  yn  are  independent  draws  from  the  Poisson  distributions  with  means 


exp  |  9 ®  ^  9(j'lx[3)  j  , 

3= 1 


exp 


9^  +  ^  ,  •  •  • ,  exp  |  9^  +  ^  9^x%)  , 


3= 1 


3= 1 

respectively,  (15) 


where  9  is  the  nuisance  parameter  and  9  is  a  maximum-likelihood  estimate.  The  measure 
of  discrepancy  for  Hq  is  usually  taken  to  be  the  log-likclihood-ratio  (also  known  as  the 
deviance) 

n 

92  =  2^j/fcln(j/fc//ifc),  (16) 

k= 1 

where  jig  is  the  mean  of  the  Poisson  distribution  associated  with  ijk  in  Hq  ,  namely, 


pk  =  exp  ^  . 

One  example  is  the  cubic  polynomial 

ln(//fc)  =  d(0)  +  9{1)xk  +  9(2\xk)2  +  6{3\xk)3 
for  k  =  1,  2,  . . . ,  n,  which  comes  from  the  choice  m  =  3  and 

/v*d)  rp  •  rpC)  ( rp  \  2  .  spft)  /  \  3 

^k  —  K^k)  ?  ^k  —  y^k J 


(17) 


(18) 

(19) 


for  k  —  1,  2,  . . . ,  n,  given  observations  as  ordered  pairs  of  scalars  (xi,yi),  (a?2, 2/2) ,  •••, 
(Xn,  yn)-  Of  course,  there  are  similar  formulations  for  other  generalized  linear  models,  such 
as  those  discussed  by  McCullagh  and  Nelder  (1989). 
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